[V1] Improve `AsyncLLM` and API Server #11237

robertgshaw2-neuralmagic · 2024-12-16T16:53:40Z

NOTE: I am in process of slipping this PR into separate pieces

SUMMARY:

Improve API server performance by making add_request_id middleware optional
Improve AsyncLLM performance with 3 process architecture and reduction of task switching in AsyncLLM
Better error handling with SIGQUIT
Reduce usage of zmq sockets by switching readiness probes to use mp.Pipe
Avoid footguns with ulimit
Added abstractions for background process handling

TODO:

Re-enable profiling
Make AsyncLLM shutdown cleaner (raises an error in the stack trace)
More testing

github-actions · 2024-12-16T16:53:51Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

git

mergify · 2024-12-23T23:06:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/engine/async_llm.py

mgoin · 2024-12-24T03:21:45Z

vllm/v1/engine/async_llm.py

+            for out in outputs:
+                # Note: it is possible that a request was aborted
+                # due to client cancellation while EngineCoreOutputs
+                # are still flowing, so we just ignore.
+                if out.request_id in self.rid_to_queue:
+                    self.rid_to_queue[out.request_id].put_nowait(out)


Do you think this isn't worth logging since we are sure the request was aborted?

vllm/v1/engine/core.py

vllm/v1/utils.py

Co-authored-by: Michael Goin <[email protected]>

rickyyx

With the http middleware issue fixed - curious how much perf diff the new 3 process architecture yields?

mergify · 2024-12-26T01:41:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-neuralmagic · 2024-12-31T15:33:07Z

Closing as complete

robertgshaw2-neuralmagic added 5 commits December 16, 2024 03:42

first rev of 3 process architecture

c2ad07c

finally able to generate text

f0b3e36

breaking under load

ce8aa2c

working e2e

457d618

workign e2e

c980dbd

robertgshaw2-neuralmagic requested review from WoosukKwon, njhill, ywang96, comaniac and alexm-neuralmagic as code owners December 16, 2024 16:53

mergify bot added the frontend label Dec 16, 2024

robertgshaw2-neuralmagic added 16 commits December 16, 2024 19:06

stash

cba2d54

stash

3ae44a8

remove async stream

3ef5687

fix protocol

b350084

clean up completion client

abd7fa3

stash

6986457

updated

816e965

updated comment

cebf287

remove comptibility

adcc3d2

format

4344f1b

format/comments

d7b42a0

update comment

c987a76

format

f3ff0e0

updated examples

fbf647f

more cleaning

b1105b9

make pr smaller

ea7289b

robertgshaw2-neuralmagic changed the title ~~workign e2e~~ [V1] API Server Performance Dec 16, 2024

updated

06dcb1b

fixup

cbc043e

git

mergify bot added the needs-rebase label Dec 23, 2024

robertgshaw2-neuralmagic added 4 commits December 23, 2024 23:16

mypy

8061078

stash

8372665

almost there with llm engine

6b4f2bb

format'

db7d055

mgoin reviewed Dec 24, 2024

View reviewed changes

robertgshaw2-neuralmagic and others added 5 commits December 24, 2024 14:28

clean

98053d6

updated

4713e29

nit

4f946eb

Update vllm/v1/utils.py

59c6430

Co-authored-by: Michael Goin <[email protected]>

Merge branch 'main' into remove-async-stream

9dceec4

mergify bot removed the needs-rebase label Dec 24, 2024

robertgshaw2-neuralmagic added 4 commits December 24, 2024 14:37

updated

856838d

updated

94fe4af

stash

127045a

remove log

1352386

rickyyx reviewed Dec 26, 2024

View reviewed changes

mergify bot added the needs-rebase label Dec 26, 2024

robertgshaw2-neuralmagic closed this Dec 31, 2024

robertgshaw2-neuralmagic deleted the remove-async-stream branch December 31, 2024 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Improve `AsyncLLM` and API Server #11237

[V1] Improve `AsyncLLM` and API Server #11237

robertgshaw2-neuralmagic commented Dec 16, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 16, 2024

mergify bot commented Dec 23, 2024

mgoin Dec 24, 2024

rickyyx left a comment

mergify bot commented Dec 26, 2024

robertgshaw2-neuralmagic commented Dec 31, 2024

[V1] Improve AsyncLLM and API Server #11237

[V1] Improve AsyncLLM and API Server #11237

Conversation

robertgshaw2-neuralmagic commented Dec 16, 2024 • edited by github-actions bot Loading

SUMMARY:

TODO:

github-actions bot commented Dec 16, 2024

mergify bot commented Dec 23, 2024

mgoin Dec 24, 2024

Choose a reason for hiding this comment

rickyyx left a comment

Choose a reason for hiding this comment

mergify bot commented Dec 26, 2024

robertgshaw2-neuralmagic commented Dec 31, 2024

[V1] Improve `AsyncLLM` and API Server #11237

[V1] Improve `AsyncLLM` and API Server #11237

robertgshaw2-neuralmagic commented Dec 16, 2024 •

edited by github-actions bot

Loading